Apply PDP to the regression example of predicting bike rentals. Fit a random forest approximation for the prediction of bike rentals (cnt). Use the partial dependence plot to visualize the relationships the model learned. Use the slides shown in class as model.
Analyse the influence of days since 2011, temperature, humidity and wind speed on the predicted bike counts.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(reshape2)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(randomForestSRC)
## Warning: package 'randomForestSRC' was built under R version 4.1.3
##
## randomForestSRC 3.1.0
##
## Type rfsrc.news() to see new features, changes, and bug fixes.
##
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.3
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(ggplot2)
#setwd("/Users/cmonserr/OneDrive - UPV/Trabajo_2/Asignaturas/Evaluacion de modelos/Practicas/Practica 3/Bike-Sharing-Dataset")
days <- read.csv("day.csv")
hour <- read.csv("hour.csv")
days$dteday <- as_date(days$dteday)
days_since <- select(days, workingday, holiday, temp, hum, windspeed, cnt)
days_since$days_since_2011 <- int_length(interval(ymd("2011-01-01"), days$dteday)) / (3600*24)
days_since$SUMMER <- ifelse(days$season == 3, 1, 0)
days_since$FALL <- ifelse(days$season == 4, 1, 0)
days_since$WINTER <- ifelse(days$season == 1, 1, 0)
days_since$MISTY <- ifelse(days$weathersit == 2, 1, 0)
days_since$RAIN <- ifelse(days$weathersit == 3 | days$weathersit == 4, 1, 0)
days_since$temp <- days_since$temp * 47 - 8
days_since$hum <- days_since$hum * 100
days_since$windspeed <- days_since$windspeed * 67
rf <- rfsrc(cnt~., data=days_since, importance=TRUE)
results <- select(days_since, days_since_2011, temp, hum, windspeed, cnt)
nr <- nrow(days_since)
for(c in names(results)[1:4])
{
for(i in 1:nr){
r <- days_since
r[[c]] <- days_since[[c]][i]
sal <- predict(rf, r)$predicted
results[[c]][i] <- sum(sal) / nr
}
}
p1 <- ggplot(data = days_since, aes(x=days_since_2011, y=results$days_since_2011)) + geom_line() +
ylim(0, 6000) + geom_rug(alpha=0.1, sides="b")+ labs(x="Days since 2011", y="Prediction")
p2 <- ggplot(data = days_since, aes(x=temp, y=results$temp)) + geom_line() + ylim(0, 6000) +
geom_rug(alpha=0.1, sides="b") + labs(x="Temperature", y=NULL)
p3 <- ggplot(days_since, aes(x=hum , y = results$hum)) + geom_line() + geom_rug(alpha=0.1, sides="b") +
ylim(0, 6000) + labs(x="Humidity", y=NULL)
p4 <- ggplot(data = days_since, aes(x=windspeed, y=results$windspeed)) + geom_line() + ylim(0, 6000) +
geom_rug(alpha=0.1, sides="b") + labs(x="Wind speed", y=NULL)
p <- subplot(p1,p2,p3,p4, shareY = T, titleY=T, titleX=T)
p
Para el caso de Días desde 2011, se observa cierta correlación positiva entre la predicción de las bicicletas alquiladas y los días a partir del 2011. Por lo que, en general, cuantos más días pasan más aumenta el número de bicicletas alquiladas. Aunque, no se observa una fuerte correlación positiva ya que en algunos momentos decrece ligeramente el número de bicicletas alquiladas. Mirando la gráfica, vemos que al principio aumenta, se observa un pequeño parón, vuelve a aumentar y por último un ligero decrecimiento.
Para la temperatura, se observa también una correlación positiva para la variable temperatura y la predicción de las bicicletas alquiladas. En general, para mayores temperaturas el número de bicicletas alquiladas es mayor. Excepto para las temperaturas superiores a 23 º, esto parece lógico ya que cuando hace demasiado calor las personas no tienen tantas ganas de ir en bicicleta.
Si la humedad es menor de 50, el alquiler de bicicletas no varía, por lo que no influye. Sin embargo, a partir 50 observamos que existe una relación negativa entre estas dos variables, según aumenta la humedad disminuye el número de bicicletas alquiladas.Además, cuando la humedad es menor de 25, apenas hay muestras, por lo que no podemos sacar ninguna conclusión en ese rango.
Por último, respecto a la velocidad del viento, cuando es superior a 25, el número de muestras es reducido, por lo que, hasta ese momento, se puede apreciar una correlación positiva entre la velocidad del viento y el alquiler de bicicletas.
Generate a 2D Partial Dependency Plot with humidity and temperature to predict the number of bikes rented depending of those parameters.
BE CAREFUL: due to the size, extract a set of random samples from the BBDD before generating the the data for the Partial Dependency Plot.
Show the density distribution of both input features with the 2D plot as shown in the class slides.
TIP: Use geom_tile() to generate the 2D plot. Set width and height to avoid holes.
Interpret the results.
sampled <- sample_n(days_since, 40)
temp <- sampled$temp
hum <- sampled$hum
th <- inner_join(data.frame(temp),data.frame(hum), by=character())
th$p <- 0
for(i in 1:nrow(th)){
r <- days_since
r[["temp"]] <- th[["temp"]][i]
r[["hum"]] <- th[["hum"]][i]
sal <- predict(rf, r)$predicted
th[["p"]][i] <- sum(sal) / nr
}
Apply the previous concepts to predict the price of a house from the database kc_house_data.csv. In this case, use again a random forest approximation for the prediction based on the features bedrooms, bathrooms, sqft_living, sqft_lot, floors and yr_built. Use the partial dependence plot to visualize the relationships the model learned.
BE CAREFUL: due to the size, extract a set of random samples from the BBDD before generating the data for the Partial Dependency Plot.
Analyse the influence of bedrooms, bathrooms, sqft_living and floors on the predicted price.
d <- read.csv("kc_house_data.csv")
sampled <- sample_n(d, 1000)
sampled <- select(sampled, bedrooms, bathrooms, sqft_living, sqft_lot, floors, yr_built, price)
rf <- rfsrc(price~., data=sampled)
results <- select(sampled, bedrooms, bathrooms, sqft_living, floors, price)
nr <- nrow(sampled)
for(c in names(results)[1:4]){
for(i in 1:nr){
r <- sampled
r[[c]] <- sampled[[c]][i]
sal <- predict(rf, r)$predicted
results[[c]][i] <- sum(sal) / nr
}
}